909-744-2891

2014-09-09 A critique of SNARE

Abstract

Shuang Hao, et al in [1] present a spam filter scoring system which is a linear combination of 13 simpler filters, where the coefficients are determined by linear regression using training data.

Critique

Overall, I think this is nice paper with a few flaws. I especially like using the geodesic distance between sender and receiver as part of the filter.

I think the introduction mis-characterizes the state of spam filters in 2009. The reputation of a "sender" was no longer just the reputation of the sender's ip address, but it included the reputation of various domain names associated with the sender - the domain names in the HELO and MAIL FROM commands, and the domain name in the in-addr.arpa reverse dns zone for the ip address. SURBL and URIBL popularized the use of those name based reputations, and in early 2010 Spamhaus created the DBL which is an equivalent list.

"Unfortunately, these blacklists can be both in-complete and slow-to-respond to new spammers." That was a common complaint about the SBL much earlier, but by 2004 Spamhaus had the XBL which included the CBL as a principle component. The target of the CBL/XBL was and is infected machines - aka botnet members. It responds fairly quickly to newly infected machines since the CBL has a widely distributed collection of spam traps.

Section 2.1 contains "This dynamism makes maintaining responsive IP blacklists a manual, tedious, and inaccurate process". Not really. Even in 2009, much of that maintenance work was automated. And from the point of view of the *user* of the DNSBL, it does not matter.

Section 2.2 contains "These messages were reported from approximately 2,500 distinct TrustedSource appliances geographically distributed around the world." But what percentage of them were in the US vs outside the US?

Section 3.1.1 contains "For certain ham, 90% of the messages travel about 2,500 miles or less." Yes, because I suspect this paper is looking mostly at US based receivers, and 2,500 miles mostly covers the US. Rather than geodesic distance, I suspect that (sender country != receiver country) would work as well, especially for smaller countries. I have long claimed that almost all email that crosses a country border is spam.

Section 6.1 contains "(2) Bots tend to aggregate within ASes, since the machines in the same ASes are likely to have the same vulnerability. It is not easy for spammers to move mail servers or the bot armies to a different AS;". I doubt that was true in 2009, it is certainly not true today. The vulnerabilities that the bots are targeting (in the OS, in Java, in the browser, in Wordpress, etc) are widespread across the world.

References

[1] Shuang Hao, et al, Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine https://www.usenix.org/legacy/events/sec09/tech/full_papers/sec09_network.pdf Aug. 2009.